In this blog, I will explain to the you how one might use MLE and MM to model
(a) Glycohemoglobin
(b) Height of adult females.
The data will be from National Health and Nutrition Examination Survey 2009-2010 (NHANES), available from the Hmisc package. I will compare and contrast the two methods in addition to comparing and contrasting the choice of underlying distribution.
To be specific, I will show how I calculated estimates of parameters and provide visuals that show the estimated distribution compared to the empirical distribution (comment on the quality of the fit): Overlay estimated pdf onto histogram Overlay estimated CDF onto eCDF QQ plot (sample vs estimated dist) I will also explain how I calculated the median from the estimated distribution, demonstrate how to generate the median sampling distribution, how I calculate the range of middle 95% of sampling distribution, and explain why such a quantity is important.
At the end of this blog, I will provide you with a set of “take-home messages”.
Maximum likelihood (MLE) and method of moments (MM) are two common methods for constructing a model.
First, I will import data and select the column I will use.
# require package and import data
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Hmisc::getHdata(nhgh)
d1 <- nhgh %>%
filter(sex == "female") %>%
filter(age >= 18) %>%
select(gh, ht) %>%
filter(1:n()<=1000)
| Normal | Gamma | Weibull | |
|---|---|---|---|
| Estimates of parameters | |||
| Overlay estimated CDF onto eCDF | |||
| Overlay estimated pdf onto histogram | |||
| QQ plot (sample vs estimated dist) | |||
| Estimated Median | |||
| Median Samp Dist (hist) | |||
| Range of middle 95% of Samp Dist |
To generate the estimates of parameters of normal distribution, I will use mean() to calculate the mean value and use sd() to calculate the standard deviation.
# Estimates of parameters
mm_norm_mean=mean(d1$gh)
mm_norm_sd=sd(d1$gh)
mm_norm_mean
## [1] 5.7246
mm_norm_sd
## [1] 1.052246
Then, we could use these parameters to plot cdf and ecdf. To plot ecdf, I will use plot() and ecdf(). To draw the estimated CDF, we can use pnorm() to get the cdf and use curve() to draw its line. To overlay estimated CDF onto eCDF, we have to set add=TRUE in curve function. From the graphic, we can see the empirical CDF doesn’t match cdf well.
# Overlay estimated CDF onto eCDF
plot(ecdf(d1$gh),main = "Glycohemoglobin(MM)",ylab="CDF",xlab="GH",pch = 5,col="grey")
curve(pnorm(x, mm_norm_mean,mm_norm_sd),add=TRUE,col="pink",lwd=3)
legend("topleft",c("eCDF","estimated CDF"), col = c("grey","pink"), lwd=2,bty = "n")
Next, we will use these parameters to plot histogram and estimated PDF. To plot histogram, I will use hist(). To draw the estimated PDF, we can use dnorm() to get the pdf and use curve() to draw its line. To overlay estimated PDF onto histogram, we have to set add=TRUE in curve function. From the graphic, We can see that estimated pdf also doesn’t match the pdf well.
# Overlay estimated pdf onto histogram
hist(d1$gh,main="Glycohemoglobin(MM)",xlab="GH",breaks=100,freq=FALSE,col="white")
curve(dnorm(x,mm_norm_mean,mm_norm_sd),add=TRUE,col="pink",lwd=2)
We can also create a QQ plot using these parameters. I will first use quantile() to get sample quantile and use qnorm() to get theoretical quantile. Then, I use plot() to create the QQ plot, x-axis is sample quantile and y-axis is theoretical quantile. From the graphic, we can see that plotted points nearly do not fall along the line y=x, which means that simulated distribution does not match the theoretical sampling distribution.
# QQ plot (sample vs estimated dist)
ps<-ppoints(1000)
sample_q=quantile(d1$gh,ps)
theo_q=qnorm(ps,mean=mm_norm_mean,sd=mm_norm_sd)
plot(theo_q,sample_q,pch=10,main="QQ Plot for normal distribution",ylab="sample quantile",xlab="theoretical quantile",col="pink")
abline(0,1,col="grey")
Plot simulated data on the y axis and the theoretical data on the x aixs. The plotted points do not fall along the line y=x well. Thus, the simulated distribution seems to not agree with the theoretical sampling distribution.
Then, I will use qnorm() to generate the estimated median and make sure the argument p=0.5. The first one is analytical median, the second one is estimated median.
median(d1$gh)
## [1] 5.5
# Estimated Median
qnorm(.5,mean = mm_norm_mean,sd=mm_norm_sd)
## [1] 5.7246
I will use simulation method to generate the median sample where the sample size is 1000 and simulation times are 5000. Then, we can use hist() to draw a histogram for simulated medians.
# Median Samp Dist (hist)
N <- 1000
M <- 5000
out <- array(rnorm(M*N,mean =mm_norm_mean,sd=mm_norm_sd ), c(M,N))
sample <- apply(out,1,median)
hist(sample,xlab = "Median Sample",main="Median Samp Dist",col = "white",breaks = 50)
Finally, I will use quantile() to generate the range of middle 95% of sample distribution. We can see the results below:
# Range of middle 95% of Samp Dist
quantile(sample,c(0.05/2, 1-0.05/2))
## 2.5% 97.5%
## 5.642725 5.805665
diff(quantile(sample,c(0.05/2, 1-0.05/2)))
## 97.5%
## 0.1629399
In terms of maximum likelihood, we could see the results fit the normal distribution using MM. However, the simulated distribution seems to not agree with the theoretical distribution.
# Maximum likelihood
require(stats4)
## Loading required package: stats4
nLL <- function(mean, sd){
fs <- dnorm(
x = d1$gh
, mean = mean
, sd = sd
, log = TRUE
)
-sum(fs)
}
gh <- mle(
nLL
, start = list(mean = 1, sd = 1)
, method = "L-BFGS-B"
, lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(gh), absVal = FALSE)
par(mfrow = c(1,2))
# Overlay estimated CDF onto eCDF
plot(ecdf(d1$gh), main = "",col="pink",xlab="GH")
curve(
pnorm(x, coef(gh)[1], coef(gh)[2])
, add = TRUE
, col = "grey"
, lwd = 3
)
# Overlay estimated pdf onto histogram
hist(d1$gh, freq = FALSE, main = "",col="white",xlab="GH")
curve(
dnorm(x, coef(gh)[1], coef(gh)[2])
, add = TRUE
, col = "pink"
, lwd = 3
)
# QQ plot (sample vs estimated dist)
ps<-ppoints(1000)
sample_q=quantile(d1$gh,ps)
theo_q=qnorm(ps,mean=coef(gh)[1],sd=coef(gh)[2])
plot(theo_q,sample_q,pch=10,main="QQ Plot for normal distribution(MLE)",ylab="sample quantile",xlab="theoretical quantile",col="pink")
abline(0,1,col="grey")
median(d1$gh)
## [1] 5.5
# Estimated Median
qnorm(.5,mean = coef(gh)[1],sd=coef(gh)[2])
## [1] 5.7246
# Median Samp Dist (hist)
N <- 1000
M <- 5000
out <- array(rnorm(M*N,mean =coef(gh)[1],sd=coef(gh)[2] ), c(M,N))
sample <- apply(out,1,median)
hist(sample,xlab = "Median Sample",main="Median Samp Dist",col = "white",breaks = 50)
# Range of middle 95% of Samp Dist
quantile(sample,c(0.05/2, 1-0.05/2))
## 2.5% 97.5%
## 5.641251 5.805221
diff(quantile(sample,c(0.05/2, 1-0.05/2)))
## 97.5%
## 0.1639706
To generate the estimates of parameters of gamma distribution, I will use mean^2/var to calculate the shape and use var/mean to calculate the scale.
# Estimates of parameters
mm_ga_shape=mean(d1$gh)^2/var(d1$gh)
mm_ga_scale=var(d1$gh)/mean(d1$gh)
The shape is:
mm_ga_shape
## [1] 29.59754
The scale is:
mm_ga_scale
## [1] 0.1934147
Then, we could use these parameters to plot cdf and ecdf. To plot ecdf, I will use plot() and ecdf(). To draw the estimated CDF, we can use pgamma() to get the cdf and use curve() to draw its line. To overlay estimated CDF onto eCDF, we have to set add=TRUE in curve function. From the graphic, we can see the estimated cdf overlays the ecdf and they almost do not have a similar trend, which means the empirical CDF doesn’t match cdf well.
# Overlay estimated CDF onto eCDF
plot(ecdf(d1$gh),main = "Glycohemoglobin(MM)-Gamma",xlab="GH", ylab="CDF",pch = 5,col="grey")
curve(pgamma(x, shape=mm_ga_shape,scale=mm_ga_scale),add=TRUE,col="pink",lwd=3)
legend("topleft",c("eCDF","estimated CDF"), col = c("grey","pink"), lwd=2,bty = "n")
Next, we will use these parameters to plot histogram and estimated PDF. To plot histogram, I will use hist(). To draw the estimated PDF, we can use dgamma() to get the pdf and use curve() to draw its line. To overlay estimated PDF onto histogram, we have to set add=TRUE in curve function. From the graphic, We can see that estimated pdf also doesn’t match the pdf well.
# Overlay estimated pdf onto histogram
hist(d1$gh,main="Glycohemoglobin(MM)-Gamma",xlab="GH",breaks=100,freq=FALSE,col = "white")
curve(dgamma(x,shape=mm_ga_shape,scale=mm_ga_scale),add=TRUE,col="pink",lwd=2)
We can also create a QQ plot using these parameters. I will first use quantile() to get sample quantile and use qgamma() to get theoretical quantile. Then, I use plot() to create the QQ plot, x-axis is sample quantile and y-axis is theoretical quantile. From the graphic, we can see that plotted points nearly fall along the line y=x, which means that sample and theoretical quantiles come from the same distribution. The simulated data doesn’t match the theoretical distribution.
# QQ plot (sample vs estimated dist)
ps<-ppoints(1000)
sample_q=quantile(d1$gh,ps)
theo_q=qgamma(ps,shape=mm_ga_shape,scale=mm_ga_scale)
plot(theo_q,sample_q,pch=10,main="QQ Plot for gamma distribution",ylab="sample quantile",xlab="theoretical quantile",col="pink")
abline(0,1)
Plot simulated data on the y axis and the theoretical data on the x aixs. The plotted points do not fall along the line y=x well. Thus, the simulated data seems to not agree with the theoretical distribution.
Then, I will use qgamma() to generate the estimated median and make sure the argument p=0.5. The first one is analytical median, the second one is estimated median.
median(d1$gh)
## [1] 5.5
# Estimated Median
qgamma(.5,shape=mm_ga_shape,scale=mm_ga_scale)
## [1] 5.660259
I will use simulation method to generate the median sample where the sample size is 1000 and simulation times are 5000. Then, we can use hist() to draw a histogram for simulated medians.
# Median Samp Dist (hist)
N <- 1000
M <- 5000
out <- array(rgamma(M*N,shape = mm_ga_shape,scale = mm_ga_scale), c(M,N))
sample <- apply(out,1,median)
hist(sample,xlab = "Median",main="Median Samp Dist",breaks=50,col="white")
Finally, I will use quantile() to generate the range of middle 95% of sample distribution. We can see the results below:
# Range of middle 95% of Samp Dist
quantile(sample,c(0.05/2, 1-0.05/2))
## 2.5% 97.5%
## 5.577245 5.739108
diff(quantile(sample,c(0.05/2, 1-0.05/2)))
## 97.5%
## 0.1618631
In terms of maximum likelihood, we could see the results fit the gamma distribution using MM. However, the simulated distribution seems to not agree with the theoretical sampling distribution.
# Maximum likelihood
require(stats4)
nLL <- function(shape, scale){
fs <- dgamma(
x = d1$gh
, shape = shape
, scale = scale
, log = TRUE
)
-sum(fs)
}
gh <- mle(
nLL
, start = list(shape = 1, scale = 1)
, method = "L-BFGS-B"
, lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(gh), absVal = FALSE)
par(mfrow = c(1,2))
# Overlay estimated CDF onto eCDF
plot(ecdf(d1$gh), main = "",col="pink")
curve(
pgamma(x, shape=coef(gh)[1], scale=coef(gh)[2])
, add = TRUE
, col = "grey"
, lwd = 3
)
# Overlay estimated pdf onto histogram
hist(d1$gh, freq = FALSE, main = "",col="white")
curve(
dgamma(x, shape=coef(gh)[1], scale=coef(gh)[2])
, add = TRUE
, col = "pink"
, lwd = 3
)
# QQ plot (sample vs estimated dist)
ps<-ppoints(1000)
sample_q=quantile(d1$gh,ps)
theo_q=qgamma(ps,shape=coef(gh)[1],scale=coef(gh)[2])
plot(theo_q,sample_q,pch=10,main="QQ Plot for gamma distribution(MLE)",ylab="sample quantile",xlab="theoretical quantile",col="pink")
abline(0,1)
median(d1$gh)
## [1] 5.5
# Estimated Median
qgamma(.5,shape=coef(gh)[1],scale=coef(gh)[2])
## [1] 5.677983
# Median Samp Dist (hist)
N <- 1000
M <- 5000
out <- array(rgamma(M*N,shape = coef(gh)[1],scale = coef(gh)[2]), c(M,N))
sample <- apply(out,1,median)
hist(sample,xlab = "Median",main="Median Samp Dist",breaks=50,col="white")
# Range of middle 95% of Samp Dist
quantile(sample,c(0.05/2, 1-0.05/2))
## 2.5% 97.5%
## 5.606472 5.748611
diff(quantile(sample,c(0.05/2, 1-0.05/2)))
## 97.5%
## 0.1421387
To generate the estimates of parameters of weibull distribution, I will create functions to calculate the lambda and k.
# Estimates of parameters
lambda=function(samp_mean,k){
samp_mean/gamma(1+1/k)
}
var_weib=function(k,samp_mean,samp_var){
lambda(samp_mean,k)^2*(gamma(1+2/k)-(gamma(1+1/k))^2)-samp_var
}
plot(x=seq(10,40,by=0.1),y=var_weib(k=seq(10,40,by=0.1),samp_mean=mean(d1$gh),samp_var=var(d1$gh)),main="Weibull Distribution",xlab="k",ylab='var.weibull',col="pink")
mm_opt=optimize(f=function(x) {abs (var_weib ( k=x,samp_mean = mean(d1$ht),
samp_var = var(d1$gh))) },lower=10,upper=100)
mm_weib_k=mm_opt$minimum
mm_weib_lambda=lambda(samp_mean =mean(d1$gh),k=mm_weib_k )
Then, we could use these parameters to plot cdf and ecdf. To plot ecdf, I will use plot() and ecdf(). To draw the estimated CDF, we can use pweibull() to get the cdf and use curve() to draw its line. To overlay estimated CDF onto eCDF, we have to set add=TRUE in curve function. From the graphic, we can see the estimated cdf overlays the ecdf and they almost have the same trend, which means the empirical CDF does not match cdf well.
# Overlay estimated CDF onto eCDF
plot(ecdf(d1$gh),main = "Height of Adult Females(MM)-Weibull",xlab="GH", ylab="CDF",pch = 5,col="grey")
curve(pweibull(x,shape = mm_weib_k,scale = mm_weib_lambda),add = TRUE,col = "pink",lwd=3)
legend("topleft",c("eCDF","estimated CDF"), col = c("grey","pink"), lwd=2,bty = "n")
Next, we will use these parameters to plot histogram and estimated PDF. To plot histogram, I will use hist(). To draw the estimated PDF, we can use dweibull() to get the pdf and use curve() to draw its line. To overlay estimated PDF onto histogram, we have to set add=TRUE in curve function. From the graphic, We can see that estimated pdf does not match the pdf well.
# Overlay estimated pdf onto histogram
hist(d1$gh,main="Height of Adult Females(MM)-Weibull",xlab="Height",breaks=100,freq=FALSE,col="white")
curve(dweibull(x,shape=mm_weib_k,scale=mm_weib_lambda),add=TRUE,col="pink",lwd=3)
We can also create a QQ plot using these parameters. I will first use quantile() to get sample quantile and use qweibull() to get theoretical quantile. Then, I use plot() to create the QQ plot, x-axis is sample quantile and y-axis is theoretical quantile. From the graphic, we can see that plotted points nearly do not fall along the line y=x, which means that sample and theoretical quantiles do not come from the same distribution.
# QQ plot (sample vs estimated dist)
ps<-ppoints(1000)
sample_q=quantile(d1$gh,ps)
theo_q=qweibull(ps,shape=mm_weib_k,scale=mm_weib_lambda)
plot(theo_q,sample_q,pch=10,main="QQ Plot for weibull distribution",ylab="sample quantile",xlab="theoretical quantile",col="pink")
abline(0,1,col="grey")
Plot simulated data on the y axis and the theoretical data on the x aixs.
The plotted points do not fall along the line y=x well. Thus, the simulated data seems to not agree with the theoretical sampling distribution.
Then, I will use qweibull() to generate the estimated median and make sure the argument p=0.5. The first one is analytical median, the second one is estimated median.
median(d1$gh)
## [1] 5.5
# Estimated Median
qweibull(.5,shape=mm_weib_k,scale=mm_weib_lambda)
## [1] 5.736205
I will use simulation method to generate the median sample where the sample size is 1000 and simulation times are 5000. Then, we can use hist() to draw a histogram for simulated medians.
# Median Samp Dist (hist)
N <- 1000
M <- 5000
out <- array(rweibull(M*N,shape = mm_weib_k,scale = mm_weib_lambda), c(M,N))
sample <- apply(out,1,median)
hist(sample,xlab = "Theoretical quantile", ylab = "Sample quantile",main="Median Samp Dist",col="white")
Finally, I will use quantile() to generate the range of middle 95% of sample distribution. We can see the results below:
# Range of middle 95% of Samp Dist
quantile(sample,c(0.05/2, 1-0.05/2))
## 2.5% 97.5%
## 5.731038 5.741355
diff(quantile(sample,c(0.05/2, 1-0.05/2)))
## 97.5%
## 0.01031704
In terms of maximum likelihood, we could see the results fit the weibull distribution using MM. However, the simulated distribution still seems to not agree with the theoretical sampling distribution well. So, I don’t recommend you to use weibull distribution.
# Maximum likelihood
require(stats4)
nLL <- function(shape, scale){
fs <- dweibull(
x = d1$gh
, shape = shape
, scale = scale
, log = TRUE
)
-sum(fs)
}
gh <- mle(
nLL
, start = list(shape = 1, scale = 1)
, method = "L-BFGS-B"
, lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(gh), absVal = FALSE)
par(mfrow = c(1,2))
# Overlay estimated CDF onto eCDF
plot(ecdf(d1$gh), main = "",col="pink")
curve(
pweibull(x, shape=coef(gh)[1], scale=coef(gh)[2])
, add = TRUE
, col = "grey"
, lwd = 3
)
# Overlay estimated pdf onto histogram
hist(d1$gh, freq = FALSE, main = "",col="white")
curve(
dweibull(x, shape=coef(gh)[1], scale=coef(gh)[2])
, add = TRUE
, col = "pink"
, lwd = 3
)
# QQ plot (sample vs estimated dist)
ps<-ppoints(1000)
sample_q=quantile(d1$gh,ps)
theo_q=qweibull(ps,shape=coef(gh)[1],scale=coef(gh)[2])
plot(theo_q,sample_q,pch=10,main="QQ Plot for weibull distribution",ylab="sample quantile",xlab="theoretical quantile",col="pink")
abline(0,1)
median(d1$ht)
## [1] 160.8
# Estimated Median
qweibull(.5,shape=coef(gh)[1],scale=coef(gh)[2])
## [1] 5.64902
# Median Samp Dist (hist)
N <- 1000
M <- 5000
out <- array(rweibull(M*N,shape = coef(gh)[1],scale = coef(gh)[2]), c(M,N))
sample <- apply(out,1,median)
hist(sample,xlab = "Theoretical quantile", ylab = "Sample quantile",main="Median Samp Dist",breaks=50,col="white")
# Range of middle 95% of Samp Dist
quantile(sample,c(0.05/2, 1-0.05/2))
## 2.5% 97.5%
## 5.525208 5.772830
diff(quantile(sample,c(0.05/2, 1-0.05/2)))
## 97.5%
## 0.2476222
Take-home Message Based on the above graphic, I will not recommend you to use any distribution we have tried because the graphic for these distributions show that they do not match well, so in this situation, you’d better not using these distribution to simulate.
To generate the estimates of parameters of normal distribution, I will use mean() to calculate the mean value and use sd() to calculate the standard deviation.
# Estimates of parameters
mm_norm_mean=mean(d1$ht)
mm_norm_sd=sd(d1$ht)
mm_norm_mean
## [1] 160.7419
mm_norm_sd
## [1] 7.320161
Then, we could use these parameters to plot cdf and ecdf. To plot ecdf, I will use plot() and ecdf(). To draw the estimated CDF, we can use pnorm() to get the cdf and use curve() to draw its line. To overlay estimated CDF onto eCDF, we have to set add=TRUE in curve function. From the graphic, we can see the estimated cdf overlays the ecdf and they almost have the same trend, which means the empirical CDF is almost identical to cdf.
# Overlay estimated CDF onto eCDF
plot(ecdf(d1$ht),main = "Height of Adult Females(MM)",ylab="CDF",xlab="Height",pch = 5,col="grey")
curve(pnorm(x, mm_norm_mean,mm_norm_sd),add=TRUE,col="pink",lwd=3)
legend("topleft",c("eCDF","estimated CDF"), col = c("grey","pink"), lwd=2,bty = "n")
Next, we will use these parameters to plot histogram and estimated PDF. To plot histogram, I will use hist(). To draw the estimated PDF, we can use dnorm() to get the pdf and use curve() to draw its line. To overlay estimated PDF onto histogram, we have to set add=TRUE in curve function. From the graphic, We can see that estimated pdf matches the pdf well.
# Overlay estimated pdf onto histogram
hist(d1$ht,main="Height of Adult Females(MM)",xlab="Height",breaks=100,freq=FALSE,col="white")
curve(dnorm(x,mm_norm_mean,mm_norm_sd),add=TRUE,col="pink",lwd=2)
We can also create a QQ plot using these parameters. I will first use quantile() to get sample quantile and use qnorm() to get theoretical quantile. Then, I use plot() to create the QQ plot, x-axis is sample quantile and y-axis is theoretical quantile. From the graphic, we can see that plotted points nearlt fall along the line y=x, which means that sample and theoretical quantiles come from the same distribution. The simulated data matches the theoretical sampling distribution.
# QQ plot (sample vs estimated dist)
ps<-ppoints(1000)
sample_q=quantile(d1$ht,ps)
theo_q=qnorm(ps,mean=mm_norm_mean,sd=mm_norm_sd)
plot(theo_q,sample_q,pch=10,main="QQ Plot for normal distribution",ylab="sample quantile",xlab="theoretical quantile",col="pink")
abline(0,1)
Plot simulated data on the y axis and the theoretical data on the x aixs. The plotted points fall along the line y=x well. Thus, the simulated data seems to agree with the theoretical sampling distribution.
Then, I will use qnorm() to generate the estimated median and make sure the argument p=0.5. The first one is analytical median, the second one is estimated median.
median(d1$ht)
## [1] 160.8
# Estimated Median
qnorm(.5,mean = mm_norm_mean,sd=mm_norm_sd)
## [1] 160.7419
I will use simulation method to generate the median sample where the sample size is 1000 and simulation times are 5000. Then, we can use hist() to draw a histogram for simulated medians.
# Median Samp Dist (hist)
N <- 1000
M <- 5000
out <- array(rnorm(M*N,mean =mm_norm_mean,sd=mm_norm_sd ), c(M,N))
sample <- apply(out,1,median)
hist(sample,xlab = "Median Sample",main="Median Samp Dist",col = "white",breaks = 50)
Finally, I will use quantile() to generate the range of middle 95% of sample distribution. We can see the results below:
# Range of middle 95% of Samp Dist
quantile(sample,c(0.05/2, 1-0.05/2))
## 2.5% 97.5%
## 160.1642 161.3139
diff(quantile(sample,c(0.05/2, 1-0.05/2)))
## 97.5%
## 1.149753
In terms of maximum likelihood, we could see the results fit the normal distribution using MM. And the simulated distribution seems to agree with the theoretical sampling distribution.
# Maximum likelihood
require(stats4)
nLL <- function(mean, sd){
fs <- dnorm(
x = d1$ht
, mean = mean
, sd = sd
, log = TRUE
)
-sum(fs)
}
ht <- mle(
nLL
, start = list(mean = 1, sd = 1)
, method = "L-BFGS-B"
, lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(ht), absVal = FALSE)
par(mfrow = c(1,2))
# Overlay estimated CDF onto eCDF
plot(ecdf(d1$ht), main = "",col="pink",xlab="Height")
curve(
pnorm(x, coef(ht)[1], coef(ht)[2])
, add = TRUE
, col = "grey"
, lwd = 3
)
# Overlay estimated pdf onto histogram
hist(d1$ht, freq = FALSE, main = "",col="white",xlab="Height")
curve(
dnorm(x, coef(ht)[1], coef(ht)[2])
, add = TRUE
, col = "pink"
, lwd = 3
)
# QQ plot (sample vs estimated dist)
ps<-ppoints(1000)
sample_q=quantile(d1$ht,ps)
theo_q=qnorm(ps,mean=coef(ht)[1],sd=coef(ht)[2])
plot(theo_q,sample_q,pch=10,main="QQ Plot for normal distribution(MLE)",ylab="sample quantile",xlab="theoretical quantile",col="pink")
abline(0,1,col="grey")
median(d1$ht)
## [1] 160.8
# Estimated Median
qnorm(.5,mean = coef(ht)[1],sd=coef(ht)[2])
## [1] 160.7419
# Median Samp Dist (hist)
N <- 1000
M <- 5000
out <- array(rnorm(M*N,mean =coef(ht)[1],sd=coef(ht)[2] ), c(M,N))
sample <- apply(out,1,median)
hist(sample,xlab = "Median Sample",main="Median Samp Dist",col = "white",breaks = 50)
# Range of middle 95% of Samp Dist
quantile(sample,c(0.05/2, 1-0.05/2))
## 2.5% 97.5%
## 160.1713 161.3152
diff(quantile(sample,c(0.05/2, 1-0.05/2)))
## 97.5%
## 1.143869
To generate the estimates of parameters of gamma distribution, I will use mean^2/var to calculate the shape and use var/mean to calculate the scale.
# Estimates of parameters
mm_ga_shape=mean(d1$ht)^2/var(d1$ht)
mm_ga_scale=var(d1$ht)/mean(d1$ht)
The shape is:
mm_ga_shape
## [1] 482.1886
The scale is:
mm_ga_scale
## [1] 0.333359
Then, we could use these parameters to plot cdf and ecdf. To plot ecdf, I will use plot() and ecdf(). To draw the estimated CDF, we can use pgamma() to get the cdf and use curve() to draw its line. To overlay estimated CDF onto eCDF, we have to set add=TRUE in curve function. From the graphic, we can see the estimated cdf overlays the ecdf and they almost have the same trend, which means the empirical CDF is almost identical to cdf.
# Overlay estimated CDF onto eCDF
plot(ecdf(d1$ht),main = "Height of Adult Females(MM)-Gamma",xlab="Height", ylab="CDF",pch = 5,col="grey")
curve(pgamma(x, shape=mm_ga_shape,scale=mm_ga_scale),add=TRUE,col="pink",lwd=3)
legend("topleft",c("eCDF","estimated CDF"), col = c("grey","pink"), lwd=2,bty = "n")
Next, we will use these parameters to plot histogram and estimated PDF. To plot histogram, I will use hist(). To draw the estimated PDF, we can use dgamma() to get the pdf and use curve() to draw its line. To overlay estimated PDF onto histogram, we have to set add=TRUE in curve function. From the graphic, We can see that estimated pdf matches the pdf well.
# Overlay estimated pdf onto histogram
hist(d1$ht,main="Height of Adult Females(MM)-Gamma",xlab="Height",breaks=100,freq=FALSE,col = "white")
curve(dgamma(x,shape=mm_ga_shape,scale=mm_ga_scale),add=TRUE,col="pink",lwd=2)
We can also create a QQ plot using these parameters. I will first use quantile() to get sample quantile and use qgamma() to get theoretical quantile. Then, I use plot() to create the QQ plot, x-axis is sample quantile and y-axis is theoretical quantile. From the graphic, we can see that plotted points nearly fall along the line y=x, which means that sample and theoretical quantiles come from the same distribution. The simulated data matches the theoretical sampling distribution.
# QQ plot (sample vs estimated dist)
ps<-ppoints(1000)
sample_q=quantile(d1$ht,ps)
theo_q=qgamma(ps,shape=mm_ga_shape,scale=mm_ga_scale)
plot(theo_q,sample_q,pch=10,main="QQ Plot for gamma distribution",ylab="sample quantile",xlab="theoretical quantile",col="pink")
abline(0,1)
Plot simulated data on the y axis and the theoretical data on the x aixs. The plotted points fall along the line y=x well. Thus, the simulated data seems to agree with the theoretical sampling distribution.
Then, I will use qgamma() to generate the estimated median and make sure the argument p=0.5. The first one is analytical median, the second one is estimated median.
median(d1$ht)
## [1] 160.8
# Estimated Median
qgamma(.5,shape=mm_ga_shape,scale=mm_ga_scale)
## [1] 160.6308
I will use simulation method to generate the median sample where the sample size is 1000 and simulation times are 5000. Then, we can use hist() to draw a histogram for simulated medians.
# Median Samp Dist (hist)
N <- 1000
M <- 5000
out <- array(rgamma(M*N,shape = mm_ga_shape,scale = mm_ga_scale), c(M,N))
sample <- apply(out,1,median)
hist(sample,xlab = "Theoretical quantile", ylab = "Sample quantile",main="Median Samp Dist",breaks=50,col="white")
Finally, I will use quantile() to generate the range of middle 95% of sample distribution. We can see the results below:
# Range of middle 95% of Samp Dist
quantile(sample,c(0.05/2, 1-0.05/2))
## 2.5% 97.5%
## 160.0664 161.1983
diff(quantile(sample,c(0.05/2, 1-0.05/2)))
## 97.5%
## 1.131839
In terms of maximum likelihood, we could see the results fit the gamma distribution using MM. And the simulated distribution seems to agree with the theoretical sampling distribution.
# Maximum likelihood
require(stats4)
nLL <- function(shape, scale){
fs <- dgamma(
x = d1$ht
, shape = shape
, scale = scale
, log = TRUE
)
-sum(fs)
}
ht <- mle(
nLL
, start = list(shape = 1, scale = 1)
, method = "L-BFGS-B"
, lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(ht), absVal = FALSE)
par(mfrow = c(1,2))
# Overlay estimated CDF onto eCDF
plot(ecdf(d1$ht), main = "",col="pink")
curve(
pgamma(x, shape=coef(ht)[1], scale=coef(ht)[2])
, add = TRUE
, col = "grey"
, lwd = 3
)
# Overlay estimated pdf onto histogram
hist(d1$ht, freq = FALSE, main = "",col="white")
curve(
dgamma(x, shape=coef(ht)[1], scale=coef(ht)[2])
, add = TRUE
, col = "pink"
, lwd = 3
)
# QQ plot (sample vs estimated dist)
ps<-ppoints(1000)
sample_q=quantile(d1$ht,ps)
theo_q=qgamma(ps,shape=coef(ht)[1],scale=coef(ht)[2])
plot(theo_q,sample_q,pch=10,main="QQ Plot for gamma distribution(MLE)",ylab="sample quantile",xlab="theoretical quantile",col="pink")
abline(0,1)
median(d1$ht)
## [1] 160.8
# Estimated Median
qgamma(.5,shape=coef(ht)[1],scale=coef(ht)[2])
## [1] 160.6312
# Median Samp Dist (hist)
N <- 1000
M <- 5000
out <- array(rgamma(M*N,shape = coef(ht)[1],scale = coef(ht)[2]), c(M,N))
sample <- apply(out,1,median)
hist(sample,xlab = "Theoretical quantile", ylab = "Sample quantile",main="Median Samp Dist",breaks=50,col="white")
# Range of middle 95% of Samp Dist
quantile(sample,c(0.05/2, 1-0.05/2))
## 2.5% 97.5%
## 160.0630 161.2177
diff(quantile(sample,c(0.05/2, 1-0.05/2)))
## 97.5%
## 1.154712
To generate the estimates of parameters of weibull distribution, I will create functions to calculate the lambda and k.
# Estimates of parameters
lambda=function(samp_mean,k){
samp_mean/gamma(1+1/k)
}
var_weib=function(k,samp_mean,samp_var){
lambda(samp_mean,k)^2*(gamma(1+2/k)-(gamma(1+1/k))^2)-samp_var
}
plot(x=seq(10,40,by=0.1),y=var_weib(k=seq(10,40,by=0.1),samp_mean=mean(d1$ht),samp_var=var(d1$ht)),main="Weibull Distribution",xlab="k",ylab='var.weibull',col="pink")
mm_opt=optimize(f=function(x) {abs (var_weib ( k=x,samp_mean = mean(d1$ht),
samp_var = var(d1$ht))) },lower=10,upper=100)
mm_weib_k=mm_opt$minimum
mm_weib_lambda=lambda(samp_mean =mean(d1$ht),k=mm_weib_k )
Then, we could use these parameters to plot cdf and ecdf. To plot ecdf, I will use plot() and ecdf(). To draw the estimated CDF, we can use pweibull() to get the cdf and use curve() to draw its line. To overlay estimated CDF onto eCDF, we have to set add=TRUE in curve function. From the graphic, we can see the estimated cdf overlays the ecdf and they almost have the same trend, which means the empirical CDF does not match cdf well.
# Overlay estimated CDF onto eCDF
plot(ecdf(d1$ht),main = "Height of Adult Females(MM)-Weibull",xlab="Height", ylab="CDF",pch = 5,col="grey")
curve(pweibull(x,shape = mm_weib_k,scale = mm_weib_lambda),add = TRUE,col = "pink",lwd=3)
legend("topleft",c("eCDF","estimated CDF"), col = c("grey","pink"), lwd=2,bty = "n")
Next, we will use these parameters to plot histogram and estimated PDF. To plot histogram, I will use hist(). To draw the estimated PDF, we can use dweibull() to get the pdf and use curve() to draw its line. To overlay estimated PDF onto histogram, we have to set add=TRUE in curve function. From the graphic, We can see that estimated pdf does not match the pdf well.
# Overlay estimated pdf onto histogram
hist(d1$ht,main="Height of Adult Females(MM)-Weibull",xlab="Height",breaks=100,freq=FALSE,col="white")
curve(dweibull(x,shape=mm_weib_k,scale=mm_weib_lambda),add=TRUE,col="pink",lwd=3)
We can also create a QQ plot using these parameters. I will first use quantile() to get sample quantile and use qweibull() to get theoretical quantile. Then, I use plot() to create the QQ plot, x-axis is sample quantile and y-axis is theoretical quantile. From the graphic, we can see that plotted points nearly fall along the line y=x, which means that sample and theoretical quantiles come from the same distribution.
# QQ plot (sample vs estimated dist)
ps<-ppoints(1000)
sample_q=quantile(d1$ht,ps)
theo_q=qweibull(ps,shape=mm_weib_k,scale=mm_weib_lambda)
plot(theo_q,sample_q,pch=10,main="QQ Plot for weibull distribution",ylab="sample quantile",xlab="theoretical quantile",col="pink")
abline(0,1)
Plot simulated data on the y axis and the theoretical data on the x aixs.
The plotted points do not fall along the line y=x well. Thus, the simulated data seems to not agree with the theoretical sampling distribution.
Then, I will use qweibull() to generate the estimated median and make sure the argument p=0.5. The first one is analytical median, the second one is estimated median.
median(d1$ht)
## [1] 160.8
# Estimated Median
qweibull(.5,shape=mm_weib_k,scale=mm_weib_lambda)
## [1] 161.8065
I will use simulation method to generate the median sample where the sample size is 1000 and simulation times are 5000. Then, we can use hist() to draw a histogram for simulated medians.
# Median Samp Dist (hist)
N <- 1000
M <- 5000
out <- array(rweibull(M*N,shape = mm_weib_k,scale = mm_weib_lambda), c(M,N))
sample <- apply(out,1,median)
hist(sample,xlab = "Theoretical quantile", ylab = "Sample quantile",main="Median Samp Dist",col="white")
Finally, I will use quantile() to generate the range of middle 95% of sample distribution. We can see the results below:
# Range of middle 95% of Samp Dist
quantile(sample,c(0.05/2, 1-0.05/2))
## 2.5% 97.5%
## 161.2805 162.3187
diff(quantile(sample,c(0.05/2, 1-0.05/2)))
## 97.5%
## 1.038174
In terms of maximum likelihood, we could see the results fit the weibull distribution using MM. However, the simulated distribution still seems to not agree with the theoretical sampling distribution well. So, I don’t recommend you to use weibull distribution.
# Maximum likelihood
require(stats4)
nLL <- function(shape, scale){
fs <- dweibull(
x = d1$ht
, shape = shape
, scale = scale
, log = TRUE
)
-sum(fs)
}
ht <- mle(
nLL
, start = list(shape = 1, scale = 1)
, method = "L-BFGS-B"
, lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(ht), absVal = FALSE)
par(mfrow = c(1,2))
# Overlay estimated CDF onto eCDF
plot(ecdf(d1$ht), main = "",col="pink")
curve(
pweibull(x, shape=coef(ht)[1], scale=coef(ht)[2])
, add = TRUE
, col = "grey"
, lwd = 3
)
# Overlay estimated pdf onto histogram
hist(d1$ht, freq = FALSE, main = "",col="white")
curve(
dweibull(x, shape=coef(ht)[1], scale=coef(ht)[2])
, add = TRUE
, col = "pink"
, lwd = 3
)
# QQ plot (sample vs estimated dist)
ps<-ppoints(1000)
sample_q=quantile(d1$ht,ps)
theo_q=qweibull(ps,shape=coef(ht)[1],scale=coef(ht)[2])
plot(theo_q,sample_q,pch=10,main="QQ Plot for weibull distribution",ylab="sample quantile",xlab="theoretical quantile",col="pink")
abline(0,1)
median(d1$ht)
## [1] 160.8
# Estimated Median
qweibull(.5,shape=coef(ht)[1],scale=coef(ht)[2])
## [1] 161.5156
# Median Samp Dist (hist)
N <- 1000
M <- 5000
out <- array(rweibull(M*N,shape = coef(ht)[1],scale = coef(ht)[2]), c(M,N))
sample <- apply(out,1,median)
hist(sample,xlab = "Theoretical quantile", ylab = "Sample quantile",main="Median Samp Dist",breaks=50,col="white")
# Range of middle 95% of Samp Dist
quantile(sample,c(0.05/2, 1-0.05/2))
## 2.5% 97.5%
## 160.8538 162.1669
diff(quantile(sample,c(0.05/2, 1-0.05/2)))
## 97.5%
## 1.313033
Take-home Message Based on the above graphic, I will really recommend you to use gamma distribution and normal distribution because their sample distribution matches the analytical distribution well no matter you use MM method or MLE method compared with weibull distribution. However, the graphic for weibull distribution show that they do not match well, so in this situation, you’d better not using weibull distribution.